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Abstract 


This  report  presents  the  results  of  the  Dynamic  Image  Interpretation  for  the  Au¬ 
tonomous  Vehicle  Navigation  project  from  the  time  period  2/26/85  to  7/12/89.  The 
purpose  of  the  project  is  to  develop  algorithms  and  tools  to  enable  a  robotic  ground  vehi¬ 
cle  to  navigate  autonomously  through  realistic  landscapes. 

In  this  final  annual  report,  we  summarize  our  accomplishments  in  constructing  robust 
algorithms  to  be  used  for  vehicle  navigation  as  well  as  tools  that  have  been  developed  to 
more  efficiently  utilize  these  algorithms. 
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Summary 


Over  the  course  ot  this  contract  from  February  26,  1985  to  July  12,  1989,  our  research 
has  fallen  into  two  broad  categories:  Motion  and  Mobile  Robot  Navigation.  We  firs 
summarize  our  work  on  motion  and  then  that  on  mobile  robot  navigation. 

Work  on  Motion 

Our  research  on  motion  has  led  us  to  develop  a  variety  of  motion  algorithms  and  in 
most  cases,  apply  them  to  real-world  image  sequences  including  the  domains  of  robot  arm 
workspaces,  indoor  hallways,  and  outdoor  sidewalk/road  scenes. 

Anandan  constructed  an  algorithm  for  determining  feature  point  correspondences  be¬ 
tween  frames  that  allowed  the  computation  of  dense  displacement  fields  with  associated 
confidences.  The  algorithm  can  also  be  used  to  effectively  track  points  during  motion. 
Glazer  developed  two  algorithms  for  the  efficient  computation  of  image  motions  using  1- 
erarchical  multiresolution  methods  operating  over  image  data  pyramids.  Adiv  developed 
an  algorithm  (to  date,  the  only  one  that  exists)  for  general  sensor  motion  (five  degrees 
of  freedom)  in  an  environment  with  objects  undergoing  independent  genera  mo  ion.  e 
also  analyzed  the  conditions  under  which  the  determination  of  these  motion  parameters 
would  be  ambiguous.  In  related  work,  Snyder  analyzed  the  effects  of  uncertainty  in  e 
location  of  the  FOE  and  of  feature  points  in  the  image  on  the  computation  of  depth,  an 
showed  how  this  analysis  could  be  used  to  provide  quantitative  predictions  for  constraining 
the  search  window  used  for  matching  these  points  in  future  frames.  He  also  analyzed  the 
relative  efficacy  of  motion  and  stereo  for  depth  computations. 

Much  of  our  work  has  centered  on  the  recovery  of  depth  from  assumed  translational  mo¬ 
tion.  Pavlin  developed  an  efficient  algorithm  for  extracting  the  focus-of-expansion  (FOE) 
from  a  sensor  undergoing  pure  translational  motion  (i.e.,  two  degrees  of  freedom)  to  an 
accuracy  of  a  few  degrees.  Bharwani,  et  al.  used  Pavlin’s  work  to  develop  a  multi-frame 
algorithm  for  depth  extraction  under  known  translational  motion  which  iteratively  pre¬ 
dicted  the  image  motion  of  a  feature  point  in  future  frames,  determined  correspondence  by 
a  search  over  the  limited  predicted  area,  and  then  refined  the  depth  estimate  using  the  new 
match.  Difficulties  with  this  algorithm  led  us  to  develop  a  general  motion  algorithm  by 
combining  the  optical  flow  computation  of  Anandan  and  the  motion  parameter  estimation 
component  of  Adiv’s  algorithm.  This  algorithm  seems  to  be  able  to  predict  depth  with  an 

error  of  about  10%.  „  . 

Other  techniques  we  developed  to  extract  depth  from  motion  are  those  due  to  Bala- 

subramanyam,  Snyder,  and  Weiss  using  stereoscopic  motion,  Pavlin  using  assumptions  of 
constant  general  motion,  Williams  and  Hanson  using  grouped  geometric  structures  and 
Sawhney  and  Oliensis  using  the  image  traces  of  points  undergoing  purely  rotational  mo¬ 
tion.  A  further  aspect  of  our  research  has  been  the  collection  of  extensive  motion  data 
with  ground  truth  of  known  precision.  These  data  were  collected  on  the  Autonomous  Land 
Vehicle  (ALV)  at  Martin  Marietta’s  Denver,  Colorado  test  site,  and  are  presently  available 
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to  the  general  vision  community. 

Work  on  Mobile  Robot  Navigation 

In  the  past,  mobile  robots  have  been  constrained  to  operate  in  either  an  indoor  or  an 
outdoor  environment,  but  not  both.  Special  purpose  representations  and  ad  hoc  sensor 
techniques  geared  toward  tasks  of  narrow  focus  have  dominated  these  efforts.  Our  mo¬ 
bile  robot  effort  has  addressed  the  problem  of  enabling  a  mobile  automaton  to  navigate 

intelligently  through  indoor  and  outdoor  environments. 

Our  first  attempt  to  construct  such  a  “cosmopolitan”  robot  was  the  development  of  the 
9  Autonomous  Robot  Architecture  (AuRA)  by  Arkin  which  makes  use  of  a  “meadow  map 

for  global  path  planning.  This  map  serves  as  the  robot’s  long  term  memory  and  contains 

imbedded  a  priori  knowledge  to  guide  sensor  expectations.  . 

Arkin’s  work  has  been  used  by  Fennema  to  further  investigate  the  problem  of  navigating 
intelligently  through  arbitrary  environments.  He  uses  model-based  processing  of  the  visual 
sensory  data  as  the  primary  mechanism  for  obstacle  avoidance,  movement  through  the 
environment,  and  measuring  progress  towards  a  given  goal.  The  modular  building  blocks 
of  the  system  include  the  planning  and  plan  monitoring  modules,  a  set  of  vision  modules, 
a  3-D  modelling  system,  a  2-D  feature  matching  and  fitting  system,  and  finally  a  3-D 
pose  refinement  system  for  updating  the  robot’s  location  and  orientation. 

The  world  model  is  developed  in  a  3-D  solid  modelling  package,  GeoMeter,  developed 
by  Connolly,  Weiss,  et  al.  GeoMeter  serves  as  a  system  for  representing  both  polyhedral 
solid  objects  (such  as  buildings)  in  terms  of  basic  geometrical  entities  such  as  vertices, 
faces,  and  edges,  as  well  as  curved  surfaces.  It  has  been  used  to  construct  a  3-D  model  of 
both  indoor  and  outdoor  environments. 

An  important  problem  in  model-driven  3-D  interpretation  is  how  to  use  approximate 
knowledge  of  the  location  and  orientation  of  the  sensor,  models  of  objects  in  the  environ¬ 
ment,  and  the  results  of  low-level  vision  to  determine  the  image-to-model  correspondence. 
The  approach  we  have  taken  is  to  separate  2-D  model-to-image  matching  from  the  de- 
termination  of  the  3-D  pose  parameters.  Mechanisms  for  optimal  2-D  model  matching, 
used  to  locate  landmarks  derived  from  the  world  model  and  to  estimate  the  robot’s  current 
position,  are  the  subject  of  research  by  Beveridge,  et  al.,  who  determine  correspondences 
between  the  model  and  the  data  lines  such  that  an  optimized  spatial  fit  will  produce  the 
lowest  match  error.  Methods  for  determining  the  “pose,”  i.e.,  the  position  and  orientation, 
of  the  robot  with  respect  to  a  world  coordinate  system  have  been  developed  by  Kumar. 

The  successes  in  actual  robot  experimentation  to  date  at  UMass  have  been  modest, 
but  are  increasing  in  power  and  robustness,  and  are  beginning  to  have  real  significance. 
Successful  navigation  of  both  an  outdoor  sidewalk  and  an  indoor  hall  using  the  approaches 
of  Fennema,  Beveridge,  Kumar,  et  al.  has  been  achieved.  The  algorithm  is  quite  robust 
working  with  (unchanging)  environments  in  the  presence  of  significant  path  edge  disconti 
nuities  (doorways,  vehicle  tracks,  clutter  etc.).  To  date,  obstacle  avoidance  on  vehicle  runs 
has  been  handled  using  ultrasonic  data.  Dead-reckoning  information  is  used  minimally  in 
our  system  as  our  goal  is  to  serve  as  a  testbed  for  vision  algorithms. 
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Many  of  the  issues  involved  in  the  mobile  vehicle  research  can  be  seen  as  complemen¬ 
tary  to  those  of  other  areas  in  our  vision  and  robotics  groups.  The  use  of  perceptual  and 
motor  schemas  in  the  proposed  vehicle  architecture  exploits  many  of  the  concepts  use  in 
both  the  VISIONS  scene  interpretation  group  and  the  work  being  done  in  the  Labora  ory 
for  Perceptual  Robotics’  distributed  programming  environment.  Multi-sensor  integration, 
certainly  crucial  for  the  vehicle’s  domain,  will  benefit  from  the  work  being  done  on  the 
integration  of  vision,  touch,  and  force  sensing.  Our  research  on  developing  parallel  im¬ 
plementations  of  robust  vision  algorithms  is  certainly  synergistic  with  our  development  of 
parallel  architectures  for  real  time  vision  processing. 


1  Introduction 

One  of  the  key  features  of  an  object  that  usually  distinguishes  it  from  other  objects  m 
the  environment  is  its  movement  relative  to  them.  Even  when  an  object  is  camouflaged 
by  its  similarity  in  appearance  to  other  objects,  any  independent  movement  of  the  object 
immediately  gives  it  away.  In  addition,  if  there  is  relative  movement  between  the  camera 
and  the  object,  the  viewer  is  automatically  provided  with  several  distinct  views  of  e 
object  and  therefore  with  3D  structures  and  their  dynamic  characteristics. 

The  two  most  common  methods  of  obtaining  two  images  from  two  distinct  views  are 
stereopsis  and  motion.  Stereopsis  is  when  two  images  are  obtained  simultaneously  by 
two  cameras.  Motion  is  when  several  images  are  taken  one  after  another  by  a  sing  e  camera 
moving  with  respect  to  the  environment.  In  most  applications  of  stereopsis,  it  is  common 
to  orient  the  cameras  such  that  their  image  planes  are  perpendicular  to  the  ground  plane 
and  their  optical  axes  are  parallel  to  each  other.  Usually  the  displacement  between  the 
camera  locations  is  horizontal  and  parallel  to  the  image  plane. 

Given  two  images  obtained  from  either  stereo  or  motion,  the  task  is  to  combine  them 
to  provide  3D  information  about  the  objects  in  the  image.  The  process  usually  consists  o 
two  stages  -  the  establishment  of  the  correspondence  between  the  points  in  the  two  images 
to  provide  a  disparity  and  then  a  depth  map,  followed  by  some  process  that  uses  the  depth 
information  to  discover  and  describe  the  surfaces  in  the  3D  environment. 

Before  we  proceed  further,  we  define  a  few  key  terms.  The  correspondence  problem  is 
the  task  of  identifying  events  in  the  two  images  as  images  of  the  same  event  m  the 
environment.  The  disparity  is  the  distance  between  the  locations  in  the  two  images  of  the 
two  corresponding  events.  When  the  optical  axes  are  parallel  to  each  other,  the  depth  of 
a  point  is  its  distance  along  the  optical  axis  from  the  image  planes. 

Motion  processing  can  be  broadly  divided  into  two  categories: 

1.  the  camera  moves  and  the  environment  is  stationary,  and 

2.  there  are  independently  moving  objects  in  the  scene. 

The  first  case  is  easier  to  analyze  and  process,  as  can  be  seen  from  the  large  number  of 
techniques  that  have  been  developed  for  this  purpose. 
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The  most  common  approach  taken  towards  motion  analysis  is  one  in  which  the  process¬ 
ing  proceeds  bottom-up.  The  movement  of  individual  points  in  the  images  is  computed 
first,  followed  by  a  process  that  determines  the  motion  of  the  camera,  as  well  as  the  loca¬ 
tion,  3D  structure,  and  motion  of  the  objects  in  the  scene. 

One  important  term  used  in  motion  research  is  optical  flow.  Optical  flow  can  be 
broadly  defined  as  the  vector  field  representing  the  changes  in  the  positions  of  the  images 
of  environmental  points  over  time.  Strictly  speaking,  it  is  necessary  to  distinguish  between 
the  optical  flow ,  which  is  the  field  of  instantaneous  2D  velocity  vectors  of  the  points  in 
the  image  on  the  image  plane,  and  the  displacement  field,  which  is  the  field  of  discrete 
displacement  vectors  connecting  the  location  of  the  same  image-point  in  successive  image 
frames.  However,  when  the  time  interval  between  the  frames  is  small  enough,  the  displace¬ 
ment  field  is  a  good  approximation  to  the  optical  flow.  The  usual  approach  to  motion 
analysis  consists  of  two  steps-the  computation  of  optical  flow  followed  by  its  interpre¬ 
tation  to  provide  the  3D  structure  and  motion  of  the  objects  m  the  scene  as  well  as  the 
motion  of  the  camera.  The  computation  of  optical  flow  is  similar  to  the  correspondence 
problem  mentioned  earlier.  In  fact,  it  is  common  to  regard  the  correspondence  problem  in 
stereopsis  as  a  special  case  of  motion  correspondence.  However,  in  stereopsis,  the  know  - 
edge  of  the  relative  locations  of  the  cameras  constrains  the  search  for  corresponding  points 
in  a  manner  that  is  not  possible  in  motion  analysis.  Finally,  we  mention  one  important 
limitation  of  current  approaches  to  motion  analysis.  Most  of  the  techniques  for  motion 
analysis  deal  with  only  two  frames.  Some  initial  approaches  to  multi-frame  analysis  are 
described  in  the  body  of  this  report. 

Identifying  image  “events”  that  correspond  to  each  other  is  the  primary  task  of  both 
motion  and  stereo  analysis.  The  term  “events”  is  used  here  in  a  broad  sense,  to  mean  any 
identifiable  structure  in  the  image  -  e.g.,  image  intensities  in  a  neighborhood,  edges,  lines, 

texture  markings,  etc.  e 

The  techniques  that  rely  on  the  similarity  of  the  light  intensity  reflected  from  a  scene 

location  in  the  two  frames  as  the  basis  for  determining  correspondence  are  called  intensity- 
based  approaches.  Methods  that  identify  stable  image  structures,  and  use  them  as  tokens 
for  finding  correspondences  are  referred  to  as  token-based  approaches. 

The  most  popular  way  of  solving  the  correspondence  problem  is  to  divide  it  into  one  or 
two  parts.  The  first  is  the  local  correspondence  problem,  which  provides  partial  or  complete 
constraints  on  the  displacement  of  a  point  in  the  image,  based  on  image  information  in 
the  immediate  neighborhood  of  that  point.  Usually  the  local  correspondence  is  solved 
(partially  or  fully)  independently  at  all  points  of  interest  in  the  image.  The  second  part, 
where  used,  consists  in  applying  a  non-local  constraint  on  the  flow  field.  This  is  usua  y 
an  assumption  of  the  spatial  smoothness  of  the  flow  field,  or  one  that  is  derived  fromt  e 
geometry  of  rigid  bodies  in  motion.  This  constraint  can  be  either  global  or  semiglobal, 
depending  on  whether  or  not  explicit  boundaries  are  recognized,  across  which  the  constraint 

is  not  allowed  to  propagate.  . 

It  is  also  possible  to  impose  on  top  of  this  framework  for  the  computation  of  displace¬ 
ment  fields,  a  multi-frequency,  multi-resolution  approach.  In  this  approach  the  images  are 
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pre-processed  with  a  set  of  band-pass  filters  which  are  spatially  local  and  which  decompose 
the  spatial  frequency-spectrum  in  the  image  in  a  convenient  way  The  outputs  from  the 
corresponding  filters  applied  to  the  two  images  are  matched,  and  the  matching  resu  s  rom 
the  different  filters  at  the  same  location  in  the  image  are  combined  using  a  consistency 

constraintrimary  ^  motion  analysis  is  to  determine  the  3-dimensional  structure  of 
the  objects  in  the  environment  and  the  relative  movement  of  the  camera  and  the  obje 
in  the  scene.  The  determination  of  the  3-dimensional  image  displacements  or  velocities 
of  the  image-points  is  only  one  (although  an  important  one)  of  the  steps  involved.  T 
interpretation  of  the  displacement  (or  velocity)  fields  to  determine  the  3D  structure  of  e 
environment  and  the  relative  3D  motion  between  the  objects  and  the  camera  is  another 

■mportant  step.^  ^  ^  techniques  and  algorithms  we  have  developed  f°r  “si“8 

motion  analysis  to  determine  environmental  structure  and  sensor  motion.  We  also  descn  e 
how  we  have  used  these  techniques  in  concert  with  other  methods  deve  oped  in  our  g  p 
to  address  the  problem  of  intelligently  navigating  an  autonomous  mobile  robot  through  a 

3D  environment. 


2  Motion  Research 

2.1  The  Reliable  Computation  of  Optical  Flow:  A  Smoothness 
Constraint  and  a  Confidence  Measure 

Although  our  hierarchical  correlation  algorithm  [40]  for  the  computation  of  dense  displace¬ 
ment  fields  proved  to  be  an  efficient  and  reliable  technique,  there  are  still  a  number  of 
situations  where  the  algorithm  makes  mistakes.  These  situations  anse  m  areas  of  the  im¬ 
age  without  significant  intensity  variations  and  at  occlusion  or  motion  boundaries.  Our 
previous  work  [5]  attempted  to  identify  such  situations  through  the  use  of  a  confid 
measure  which  indicated  the  reliability  of  a  match  vector.  The  recent  work  of  Randan 
uses  a  relaxation  process  to  improve  matches  with  low  confidence  based  on  neighbou  g 

matches  with  higher  confidences.  ,  t 

In  his  recently  completed  doctoral  dissertation  (8],  Anandan  provides  a  unified  frame¬ 
work  for  extracting  a  dense  displacement  field  from  a  pair  of  images,  as  well  as  an  mtegrated 
system  based  on  a  matching  approach.  This  framework  appears  to  be  sufficiently  general 
to  encompass  both  gradient-based  and  correlation-matching  approaches.  It  consists  o 
a  hierarchical  scale-based  matching  scheme  using  bandpass  filters,  onen  ation-dependent 
confidence  measures,  and  a  smoothness  constraint  for  propagating  reliable  displacements. 
His  integrated  system  for  the  extraction  of  displacement  fields  uses  the  minimization  of  the 
sum-of-squared-differences  (SSD)  as  the  local  match-criterion,  computes  confidence  mea¬ 
sures  based  on  the  shape  of  the  SSD  surface,  and  formulates  the  smoothness  assumption 
as  the  minimization  of  an  error  functional.  This  overcomes  many  of  the  difficult  problems 
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that  exist  with  other  techniques. 

The  SSD  measure  which  is  to  be  minimized  is  expressed  as 

n 

SSD(x0,yo',  ^  W(i,j)  (I(x0  4-  i,t/o  +  j)  -  J(xo  +  Sx,y0  +  Sy)) . 

i,j=~n 

Here  I  and  J  are  the  intensity  functions  describing  the  first  and  second  images,  respectively, 
W  is  a  weighting  function,  n  is  the  radius  of  the  match  window,  and  Sx  and  Sy  are  the  al¬ 
and  y—  components,  respectively,  of  the  displacement  of  the  pixel  located  at  (xo,j/o)  in  the 
first  image.  In  practice,  W  is  taken  to  be  a  Gaussian,  and  n  is  chosen  to  be  2. 

The  error  functional  consists  of  two  terms:  one,  called  the  approximation  error,  mea¬ 
sures  how  well  a  given  displacement  field  approximates  the  local  match  estimates;  the 
other,  called  the  smoothness  error,  measures  the  global  spatial  variation  of  a  given  dis¬ 
placement  field.  The  finite-element  method  is  used  to  solve  the  minimization  problem.  The 
approach  also  gives  information  for  extracting  occlusion  boundaries  in  some  situations. 

The  confidence  measure  that  was  described  in  [5]  was  a  scalar  value  between  0  and  1 
that  indicated  the  reliability  of  the  displacement  vector  at  a  pixel  in  the  image.  One  such 
value  was  provided  for  each  pixel.  This  measure  was  derived  by  studying  the  properties 
of  the  error-surface  obtained  during  the  process  of  computing  the  displacement  at  a  pixel. 
However,  the  image  displacement  vector  is  a  2— D  quantity.  Hence,  it  is  appropriate  to 
have  a  2-D  confidence  measure  associated  with  the  displacement  vector. 

In  his  previous  work  [5],  Anandan  observed  that  the  error-surface  allowed  us  to  dis¬ 
tinguish  between  situations  in  which  completely  reliable  information  regarding  the  dis¬ 
placement  vector  (i.e.,  at  high  curvature  points  along  image  contours)  is  available,  those  in 
which  we  have  only  partial  information  (i.e.,  at  edge  locations  where  only  the  displacement 
perpendicular  to  the  edge  can  be  reliably  measured),  and  situations  where  there  is  no  reli¬ 
able  information  (i.e.,  at  homogeneous  intensity  areas  of  the  image).  The  new  confidence 
measure  is  a  vector  quantity  which  uses  these  distinctions. 

The  work  of  Anandan  consists  of  two  steps.  The  first  is  the  computation  of  these 
vector— valued  confidence  measures  and  the  second  is  the  smoothing  process  which  corrects 
unreliable  displacement  vectors  based  on  their  reliable  neighbours. 

•  The  new  confidence  measure  is  best  described  as  a  two-dimensional  vector.  It  is 
convenient  to  describe  the  vector  in  terms  of  the  local  orthogonal  basis  vectors  emax 
and  em;n,  which  are  the  principal  directions  for  the  SSD  surface.  The  displacement 
vector  D  can  be  decomposed  in  terms  of  its  components  along  these  basis  vectors, 
and  confidence  measures  cmax  and  cmin,  given  by  the  principal  curvatures  of  the  SSD 
surface,  are  associated  with  these  components.  The  details  of  their  computation  are 
given  in  [6].  It  is  worthwhile  to  note  that  these  are  no  longer  bound  to  be  between  0 
and  1.  The  formulation  of  the  smoothness  constraint  described  below  requires  that 
these  values  be  allowed  to  vary  between  0  and  oo. 

•  The  process  of  improving  an  unreliable  match  estimate  based  on  its  neighbours  is 
formulated  as  a  smoothness  constraint  on  the  displacement  vector  field.  The  smooth- 
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ness  constraint  consists  of  two  errors,  Esrnooth  and  Eapprox ,  whose  sum  is  minimized. 
Esmooth  measures  the  spatial  variation  of  the  displacement  field,  i.e.,  the  smoother 
the  variation,  the  smaller  the  error.  It  is  taken  to  be: 

E,mooth{{ u})  =  /  /  (ul+  ul+vl  +  vl)  dx  dy , 

where  {U}  is  the  set  of  displacement  vectors  U(x,j/)  =  {u(x,y),v(x,y))T ,  derivatives 
are  represented  by  ua  =  du/dx ,  etc.,  and  the  integration  is  over  the  whole  image. 
E  measures  the  deviation  of  the  smooth  displacement  field  from  the  initial  field 

ttppT  OX 

provided  by  the  matching  process: 

Eapprox  ({U})  =  \Cmax  (U  •  emax  -  D  •  emaxf  +  Cmm  (U  •  emin  ~  D  •  emin)  ]  • 

x,y 

The  definition  of  this  error  makes  it  clear  that  the  low  confidence  estimates  are  al¬ 
lowed  to  vary  more  than  the  high  confidence  estimates.  Hence,  the  smoothing  process 
modifies  the  initial  displacement  values  at  locations  of  low  confidence  measures  more 
than  those  at  the  locations  of  high  confidence  measures. 


The  smoothness  constraint  translates  into  a  minimization  problem  which  is  solved  using 
the  finite-element  method,  since  this  permits  the  inclusion  of  known  discontinuities  in  the 
displacement  field.  The  application  of  this  method  leads  to  a  local  relaxation  algorithm, 
which  iteratively  updates  the  displacement  vector  field  [8]. 

Anandan  has  also  shown  that  the  functional  minimization  problem  formulated  in  his 
matching  technique  converges  to  the  minimization  problem  used  in  gradient- based  tech¬ 
niques  (e.g.,  Glazer’s  technique  discussed  in  the  next  section).  In  particular,  by  relating  an 
approximation  error  functional  used  in  his  matching  approach  to  the  intensity  constraints 
used  in  the  gradient-based  approaches,  he  explicitly  identifies  confidence  measures  which 
have  thus  far  been  implicitly  used  in  the  gradient-based  approach.  Finally,  he  suggests  the 
ways  that  algorithms  operating  on  a  pair  of  frames  can  be  developed  into  multiple-frame 
algorithms,  and  discusses  their  relationship  to  spatio-temporal  energy  models.  Anandan’s 
algorithm  has  been  applied  to  many  image  sequences.  In  Figure  1,  we  show  a  pair  of 
images,  in  which  both  the  camera  and  the  dinosaur  have  moved  independently  from  one 
image  to  the  next. 

In  Figure  2,  we  show  the  corresponding  optical  flow  determined  by  Anandan  s  algo¬ 
rithm. 


2.2  Glazer’s  Hierarchical  Algorithms 

Glazer’s  recently  completed  thesis  [41]  presents  an  approach  to  motion  detection  using 
multi-resolution  methods  in  a  hierarchical  processing  architecture.  Two  motion  detection 
algorithms  are  developed  and  analyzed.  The  hierarchical  correlation  algorithm  utilizes 
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Figure  1:  The  Dinosaur-Image  Experiment. 

The  input  images  (128  x  128),  with  Frame  1  at  top,  Frame  2  at  bottom.  The  camera 
motion  is  a  translation  to  the  right,  along  with  a  rotation  about  the  vertical  axis.  The 
independent  motion  of  the  dinosaur  is  primarily  rotational  about  the  vertical  axis. 
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Figure  2:  Displacment  Field  Using  Anandan’s  Algorithm. 

The  smoothed  displacement  vector  field  computed  using  Anandan’s  algorithm  for  the 
dinosaur-image,  superimposed  on  Frame  1.  In  order  to  enhance  visibility,  only  a  32  x  32 
sample  of  the  displacement  is  shown. 
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a  coarse-to-fine  control  strategy  across  the  resolution  levels  and  overcomes  two  disad¬ 
vantages  of  single— level  correlation:  large  search  areas  requiring  expensive  searches,  and 
repetitive  image  structures  which  cause  incorrect  matches.  The  hierarchical  gradient- 
based  algorithm  [42],  generated  over  low-pass  image  pyramids,  extends  single-level  gra¬ 
dient  algorithms  to  the  computation  of  large  displacements.  Within  each  level,  the  next 
refinement  of  the  displacement  field  is  obtained  by  combining  a  local  intensity  constraint 
and  a  global  smoothness  constraint.  The  mathematical  formulation  involves  the  minimiza¬ 
tion  of  an  error  functional  consisting  of  two  terms,  corresponding  to  the  intensity  and  the 
smoothness  constraints  mentioned  above.  The  minimization  problem  is  solved  using  the 
f  finite-difference  approach  which  leads  to  a  multi-resolution  relaxation  algorithm.  A  formal 

analysis  of  the  hierarchical  gradient  algorithm  is  presented,  including  the  basic  equations 
for  computing  a  refined  disparity  vector,  the  discrete  representations  and  computations 
for  solving  these  equations,  and  a  geometric  interpretation  of  the  resulting  relaxation  algo¬ 
rithm.  The  experimental  results  show  that  the  two  algorithms  have  comparable  accuracy 
and  a  cost  analysis  shows  that  the  hierarchical  gradient  algorithm  is  less  costly. 

2,3  The  Computation  of  General  Motion  for  Independently  Mov¬ 
ing  Objects  from  Optical  Flow 

The  segmentation  of  an  image  into  independent  objects  is  one  of  the  most  difficult  problems 
in  computer  vision.  Adiv  [1,2]  has  developed  an  algorithm  which  performs  this  segmen¬ 
tation  when  the  objects  are  independently  moving.  His  algorithm  has  two  main  stages. 
In  the  first  stage,  the  optical  flow  field  (obtained,  e.g.,  via  Anandan’s  algorithm)  is  par¬ 
titioned  into  connected  segments  of  flow  vectors,  where  each  segment  is  consistent  with 
a  rigid  motion  of  a  roughly  planar  surface.  Such  a  segment  is  assumed  to  correspond  to 
part  of  only  one  rigid  object.  This  initial  organization  of  the  data  is  utilized  in  the  second 
stage  without  the  assumption  that  the  surfaces  are  planar.  Segments  are  then  grouped 
under  the  hypothesis  that  they  are  induced  by  a  single  rigidly  moving  object  and/or  by 
the  sensor  motion.  This  is  done  by  computing  the  optimal  motion  parameters  and  related 
error  measure  for  each  segment  by  employing  a  least-squares  approach  that  minimizes  the 
deviation  between  the  measured  flow  fields  and  that  predicted  from  the  estimated  motion 
and  structure.  Based  on  the  fundamental  equations  for  optical  flow: 

u  =  — -  -  ttzy  +  Hy(l  +  x2)  -  Slxxy 

v  —  — — +  Slzx  —  fbr(l  4-  y 2)  +  flyan/, 

z 

the  error  function  to  be  minimized  is: 

V  Wi  [(«i - —  —  —  -  +  flzVi  —  Qy(l  +  x?)  +  tyrs.yt)  + 

+  (/?; - -  g  l  Z  —  fiz®*  +  ^.y(1  +  yf)  —  ]> 
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where  the  translation  vector  is  (Tx  ,Ty  ,Tz),  the  rotation  vector  is  (f and  for 
each  i  between  1  and  n  ( ai,/3i )  is  the  optical  flow  vector  computed  at  pixel  (xi,yi),  with 
Wi  its  weight.  Z{  is  the  spatial  depth  of  the  corresponding  environmental  point.  The 
task  is  to  find  the  translation,  rotation,  and  spatial  depth  which  minimize  this  function. 
This  step  essentially  involves  grouping  segments  of  the  flow  field  which  are  consistent  with 
the  same  motion  parameters.  Therefore  the  output  of  Adiv’s  algorithm  is  a  set  of  object 
masks,  as  well  as  the  motion  parameters  of  each  of  these  independent  objects.  Numerous 
experiments  with  real  data  show  this  algorithm  to  have  quite  good  performance.  In  Figure 
3,  we  show  the  results  of  Adiv’s  algorithm  when  applied  to  the  image  pair  of  Figure  1.  We 
recall  that  there  is  general,  independent  motion  of  the  objects  which  are  imaged.  In  this 
example,  we  see  good  qualitative  agreement  between  the  segmentation  of  the  image  using 
Adiv’s  algorithm,  and  the  actual  objects  in  the  scene. 

2.4  Inherent  Ambiguity  in  the  Motion  Analysis  of  Noisy  Flow 
Fields 

Owing  to  the  presence  of  noise  and  other  image  imperfections,  the  optical  flow  in  an  image 
sequence  will  not  be  exact.  The  work  of  Adiv  [3,4]  mathematically  examines  the  robustness 
of  algorithms  which  compute  general  motion  from  optical  flow.  The  analysis  focuses  on 
ambiguities  that  are  inherent  in  the  sense  that  they  are  true  of  all  algorithms,  and  can 
only  be  resolved  if  constraining  assumptions  or  other  sources  of  visual  information  are 
employed. 

Two  sources  of  ambiguity  which  arise  from  noisy  flow  fields  are  examined.  The  first 
ambiguity  is  in  recovering  the  motion  parameters  from  a  noisy  flow  field  generated  by  a 
rigid  motion.  Motion  parameters  of  the  sensor  or  a  rigidly  moving  object  may  be  extremely 
difficult  to  estimate  because  there  may  exist  a  large  set  of  significantly  incorrect  solutions 
which  induce  flow  fields  similar  to  the  correct  one.  Adiv  shows  that  if  the  field  of  view 
corresponding  to  the  region  containing  the  interpreted  flow  field  is  small,  and  the  depth 
variation  and  translation  magnitude  are  small  relative  to  the  distance  of  the  object  from  the 
sensor,  then  the  determination  of  the  3-D  motion  and  structure  can  be  expected  to  be  very 
sensitive  to  noise  and,  in  the  presence  of  a  realistic  level  of  noise,  practically  impossible. 
He  also  experimentally  found  that  there  was  a  relationship  between  the  location  of  the 
focus  of  expansion  (FOE),  the  point  where  the  sensor  velocity  vector  intersects  the  image 
plane,  and  the  degree  of  ambiguity. 

The  second  ambiguity  is  in  the  decomposition  of  the  flow  field  into  sets  of  vectors 
corresponding  to  independently  moving  objects.  Two  independently  moving  objects  may 
induce  optical  flows  which  are  compatible  (modulo  the  noise)  with  the  same  motion  pa¬ 
rameters;  hence,  there  is  no  way  to  refute  the  hypothesis  that  these  flows  are  generated  by 
one  rigid  object.  Adiv  shows  that  the  standard  rigidity  assumption  [61]  is  not  appropriate 
for  noisy  flow  fields.  He  proposes  that  a  weaker  assumption  is  more  effective,  namely  that  a 
connected  set  of  flow  vectors,  consistent  with  a  rigid  motion  of  a  planar  surface,  is  induced 
by  a  rigid  motion. 
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Figure  3:  Adiv’s  Algorithm. 

The  grouping  of  the  flow  vectors  into  segments  is  shown  by  using  various  shapes  of  the 
vector  tails.  Vecotrs  without  a  tail  are  ungrouped.  In  addition,  the  “correct”  boundaries 
are  shown. 
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In  related  work,  Snyder  [55,56,58]  has  considered  the  general  effect  of  uncertainty  in 
the  position  of  image  points  on  algorithms  which  attempt  to  compute  environmental  struc¬ 
ture  from  motion.  He  analyzes  the  case  of  uniform  translational  sensor  motion  in  a  rigid 
environment.  He  finds  analytical  expressions  for  the  uncertainty  in  depth  which  follows 
from  uncertain  image  point  positions,  and  for  the  search  region  for  these  points  in  sub¬ 
sequent  frames  of  a  multiple  image  sequence.  The  former  result  can  be  used  to  associate 
a  confidence  measure  with  the  depth  of  each  environmental  point,  and  the  latter  can  be 
used  to  constrain  the  search  region  for  a  point  of  interest  in  subsequent  frames. 

2.5  Recovery  of  Depth  from  Approximate  Translational  Motion 

In  this  section,  we  describe  our  early  attempts  at  recovering  environmental  depth  from 
approximate  translational  motion.  As  we  will  note,  although  the  first  few  algorithms  we 
developed  appeared  at  first  sight  to  give  good  results,  extensive  experimentation  on  real 
motion  sequences  convinced  us  that  the  conditions  necessary  for  these  algorithms  to  give 
accurate  depth  values  are  only  rarely  satisfied  in  realistic  motion  scenarios,  so  the  utility 
of  these  earlier  algorithms  seems  to  be  very  restricted.  In  Section  2.5.3,  we  analyze  the 
reasons  for  the  failure  of  these  algorithms  and  present  an  algorithm  which  does  not  suffer 
from  the  same  inadequacies.  It  appears  very  promising  for  the  accurate  determination  of 
both  environmental  depth  and  sensor  motion. 

One  of  our  earliest  attempts  to  recover  the  FOE  in  the  case  of  approximate  translational 
motion  was  the  algorithm  of  Pavlin  [51].  In  this  algorithm,  the  global  search  for  the  FOE 
requires  the  computation  of  the  sum  of  errors  (e.g.,  via  correlation)  associated  with  the 
displacement  of  a  set  of  feature  points  in  two  or  more  frames.  A  sparse  sampling  of  the 
possible  location  of  the  FOE  provides  a  global  error  function  whose  minimum  localizes 
the  direction  of  motion.  The  accuracy  and  robustness  of  this  algorithm  was  found  to  be 
a  function  of  the  number  of  points  that  are  tracked  and  contribute  to  the  error  function, 
which  of  course  must  be  traded  off  against  the  amount  of  computation  that  can  be  tolerated 
for  real-time  motion  analysis. 

As  we  will  note  later  in  Section  2.5.3,  the  basic  assumption  of  this  algorithm,  namely 
that  the  sensor  motion  was  purely  translational,  is  rarely  satisfied  in  practical  situations, 
so  that  this  algorithm  is  of  limited  utility. 

2.5.1  Refinement  and  Prediction  of  Image  Dynamics  and  Environmental  Depth 
Maps  over  Multiple  Frames 

The  algorithm  developed  by  Bharwani,  et  al.  [28,29]  was  an  attempt  to  iteratively  refine 
the  depth  map  of  the  environment  over  multiple  frames  so  as  to  obtain  increasingly  more 
precise  depth  estimates.  The  algorithm  assumes  uniform  translational  sensor  motion  be¬ 
tween  adjacent  frames  of  the  multiple  image  sequence.  Although  our  preliminary  results 
on  synthetic  image  sequences  appeared  promising,  extensive  experimentation  with  this 
algorithm  on  reed  image  sequences  showed  that  the  assumption  of  uniform  translational 
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motion  central  to  this  algorithm  is  rarely  valid.  As  a  result,  the  practical  utility  of  the 
Bharwani  algorithm  appears  to  be  restricted  to  highly  controlled  environments  where  the 
motion  of  the  sensor  can  be  very  precisely  constrained. 

2.5.2  Registration 

As  we  have  noted  in  the  previous  two  sections,  the  assumption  of  uniform  translational 
motion  is  typically  violated  to  such  an  extent  that  the  algorithms  we  developed  were  of 
little  use  in  practical  motion  situations.  Since  the  violation  of  the  assumption  of  uniform 
translational  motion  implies  the  existence  of  rotational  components  in  the  sensor  motion, 
our  next  attempt  to  find  robust,  accurate  algorithms  focused  on  finding  and  removing 
these  rotational  components,  a  process  called  Registration.  We  developed  an  algorithm 
which  attempted  to  do  this,  but  exhaustive  experimentation  showed  that  the  removal  of 
the  rotational  sensor  motion  components  was  a  fragile  and  numerically  unstable  process. 
Indeed,  we  found  that  even  very  small  rotational  components  to  the  motion  (on  the  order 
of  a  few  tenths  of  a  degree)  could  not  be  effectively  removed.  Hence,  this  approach  to  the 
determination  of  sensor  motion  parameters  and  environmental  depth  was  seriously  flawed. 

All  the  problems  we  found  with  these  algorithms  led  us  to  develop  an  algorithm  which 
could  effectively  deal  with  the  existence  of  rotational  as  well  as  translational  components 
to  the  sensor  motion.  That  is,  we  sought  to  develop  an  algorithm  which  could  deal  with 
general  sensor  motion.  This  is  described  in  the  next  section. 

2.5.3  Processing  Approximate  Translational  Motion  for  a  Robotic  Vehicle 

As  we  have  noted  earlier,  our  previous  research  in  motion  analysis  led  us  to  attempt  to 
deal  with  a  real  application  subsystem  for  the  Carnegie— Mellon  University  robotic  vehicle 
[60].  The  goal  was  to  detect  obstacles  in  the  path  of  the  vehicle  at  distances  beyond  the 
limits  of  the  ERIM  laser  range  sensor  (i.e.  at  distances  beyond  40  feet).  Initial  results 
from  Bharwani’s  algorithm  implied  the  possibility  of  extracting  usable  depth  of  obstacles 
at  distances  between  40  and  80  feet.  By  applying  an  FOE  extraction  algorithm  prior  to 
the  depth  extraction  algorithm,  there  was  an  expectation  that  an  effective  subsystem  could 
be  developed.  To  accomplish  this  in  actual  imaging  situations  on  a  moving  vehicle  turned 
out  to  be  far  more  difficult  than  anticipated. 

In  dynamic  imaging  situations  where  the  sensor  is  undergoing  primarily  translational 
motion  with  a  relatively  small  rotational  component,  it  might  seem  likely  that  “approx¬ 
imate”  translational  motion  algorithms  can  be  effective  in  determining  depth.  Although 
translational  motion  was  the  dominant  form  of  motion  and  was  approximately  constant 
over  a  long  sequence  of  frames,  there  usually  were  local  variations  due  to  irregularities 
in  the  road  surface  (bumps,  holes,  and  undulations),  as  well  as  minor  rotation  of  the  ve¬ 
hicle  as  it  translates.  This  was  often  manifested  in  changes  in  the  location  of  the  FOE 
(i.e.  effectively  it  produces  a  different  translational  motion),  and  in  rotational  motions 
that  had  to  be  removed  if  correct  values  of  depth  were  to  be  extracted  from  the  feature 
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displacements.  An  attempt  to  correct  for  these  effects  via  a  relatively  simple  preprocess¬ 
ing  algorithm  (registration  of  the  image  sequence),  without  utilizing  full  analysis  of  the 
general  motion  problem,  also  led  to  difficulties.  The  issues  and  our  experimental  efforts  to 
deal  with  what  we  initially  considered  to  be  the  relatively  simple  problem  of  approximate 
translational  motion  are  discussed  in  [34].  In  this  paper,  we  show  quantitatively  that  even 
small  rotations  can  significantly  affect  the  computation  of  the  FOE.  This  is  shown  both 
theoretically  for  the  case  of  an  environment  which  can  be  approximated  as  a  frontal  plane 
and  experimentally  for  a  real  image  sequence. 

These  problems  led  us  to  compare  the  efficacy  of  a  general  motion  algorithm  obtained  by 
combining  the  previously  described  Anandan  and  Adiv  algorithms  with  a  new  translational 
motion  algorithm  obtained  by  using  a  weighted  Hough  transform  technique.  The  latter 
algorithm  finds  all  the  possible  intersections  of  the  displacement  vectors,  and  corresponding 
to  each  intersection  votes  in  a  Hough  array.  The  number  of  votes  corresponding  to  each 
intersection  is  an  increasing  function  of  the  length  and  confidences  of  the  displacement 
vectors  which  intersect.  This  ensures  that  longer  displacement  vectors  and  more  reliable 
displacement  vectors  are  weighted  more  heavily.  The  smallest  region  in  the  Hough  array 
with  at  least  p  (taken  to  be  0.1  in  the  experiments)  fraction  of  the  votes  is  then  chosen  as 
the  region  for  the  location  of  the  FOE.  The  depth  of  points  is  then  calculated  using  the 
time-adjacency  relationship: 

Z  D 

AZ  A D' 

where  Z  is  the  depth  of  the  3-D  point  P,  D  is  the  distance  from  the  FOE  of  the  corre¬ 
sponding  image  point  p,  A D  is  the  distance  p  moves  between  the  initial  and  final  frames, 
and  A Z  is  the  inter-frame  sensor  displacement. 

We  found  [34]  that  the  depths  of  points  in  a  real  image  sequence  were  obtained  with 
an  error  of  about  9%  for  the  general  motion  algorithm  and  of  about  20%  for  the  weighted 
Hough  transform  algorithm.  In  Figure  4,  we  show  six  frames  from  a  motion  sequence  taken 
with  the  Camegi e-Mellon  (CMU)  robotic  vehicle.  In  Table  1,  we  show  the  ground  truth 
and  experimental  depth  values  for  a  number  of  objects  in  this  image  sequence.  In  Table 
2,  we  show  the  results  for  the  motion  parameters  obtained  from  the  same  algorithm.  In 
Table  3,  we  show  the  average  error  in  depth  for  points  on  the  obstacles  (traffic  cones)  in 
this  image  sequence.  We  conclude  that  while  the  FOE  might  be  approximately  extracted, 
most  real  situations  require  general  motion  analysis  to  reliably  determine  the  depth  of 
points,  even  when  sensor  motion  is  primarily  translational  with  only  small  amounts  of 
rotation.  One  obvious  hardware  solution  (at  significantly  increased  cost)  is  the  use  of  a 
gyro-stabilized  platform  so  that  sensor  motion  typically  will  be  much  closer  to  the  case  of 
pure  translational  motion. 

We  have  also  developed  algorithms  which  represent  alternatives  to  this  approach.  These 
are  described  in  the  next  sections. 
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Frame  1  with  displacement  vectors  for  1-3 


Frame  5  with  displacement  vectors  for  5-7 


Frame  9  with  displacement  vectors  for  9-11 


Frame  3  with  displacement  vectors  for  3-5 


Frame  7  with  displacement  vectors  for  7-9 


Frame  11 


Figure  4:  The  Sequence  of  Image  Frames  Taken  With  the  CMU 
Robotic  Vehicle. 
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Table  1:  Depth  Values  of  Some  Points  Over  a  Sequence  of  Frames 
Using  the  General  Motion  Algorithm. 

The  two  tables  used  100  and  200  points  respectively.  Depths  are  in  feet.  *  and  ©indicate 
respectively  that  the  point  was  not  among  the  top  100  or  200  Moravec  points.  **  indicates 
that  the  point  is  absent  in  thje  image-pair. 
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lOOpts 

1-3 

.  3-5 

5-7 

7-9 

9-11 

u 

-0.09 

-0.09 

-0.09 

-0.09 

-0.09 

V 

-0.25 

-0.25 

-0.25 

-0.25 

-0.25 

w 

-0.96 

-0.96 

-0.96 

-0.96 

-0.96 

A 

-0.19 

0.17 

-0.10 

-0.04 

-0.03 

B 

0.39 

0.56 

0.53 

0.49 

0.43 

C 

-0.30 

0.01 

0.07 

0.06 

0.28 

200pts 

1-3 

3-5 

5-7 

7-9 

9-11 

U 

-0.09 

1  -0.16 

-0.09 

-0.09 

-0.09 

V 

-0.25 

-0.21 

-0.25 

-0.25 

-0.25 

W 

-0.96 

-0.96 

-0.96 

-0.96 

-0.96 

A 

-.19 

0.11 

-0.10 

-0.03 

0.03 

B 

0.41 

0.17 

0.53 

0.49 

0.43 

c 

-0.22 

-0.52 

0.10 

0.07 

0.31 

Table  2:  Motion  Parameters  Obtained  Using  the  General  Motion  Algorithm. 
The  frame  pairs  are  at  4  ft.  intervals.  The  results  have  been  tabulated  for  100  and  for 
200  Moravec  points.  (U,V,W)  is  the  unit  translation  vector,  and  (A,B,C)  is  the  rotational 
vector  in  degrees. 
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Frame-Pair 

Average  Error 

1-3 

12.4  % 

3-5 

6.9  %  [ 

5^7  1 

8.2  % 

7-9 

9.2  % 

9-11 

5.4  % 

Total  Average  Error  =  8.9  %  . 


Table  3:  Average  Errors  in  Depth  For  Points  on  the  Obstacles. 

The  obstacles  are  the  traffic  cones  in  Figure  4.  The  results  are  for  the  general  motion 
algorithm  of  Adiv. 
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2.6  Stereoscopic  Motion  Analysis  and  the  Detection  of  Discon¬ 
tinuities 

By  carrying  out  motion  analysis  with  imagery  from  a  pair  of  sensors — stereoscopic  motion 
the  additional  constraints  can  significantly  reduce  the  complexity  of  the  analysis  on  a 
theoretical  level.  Balasubramanyan  and  Snyder  [23,24,25]  have  developed  an  algorithm  to 
extract  the  parameters  of  motion  in  depth:  the  single  component  Tz  of  translation  in  depth 
(i.e.  parallel  to  the  line  of  sight)  and  the  two  components  fix  and  fly  of  rotation  in  depth 
(i.e.  rotations  that  are  not  around  the  line  of  sight).  This  is  achieved  by  building  upon 
the  work  of  Adiv  for  segmenting  the  flow  field  into  rigid  independently  moving  objects  [1], 
and  the  formulation  of  Waxman  and  Duncan  [62].  The  latter  authors  show  that  the  ratio 
of  the  relative  optical  flow  between  a  stereo  pair  of  images  to  the  disparity  between  them 
is  a  linear  function  of  the  image  coordinates: 


A  a 

T 

A0 

8 


flyXi  —  flyyt - 7T 


0, 


where  Aa  and  A f3  are  the  components  of  the  relative  optical  flow  between  the  two  images, 
8  is  the  disparity  between  the  two  images,  (xt,yi)  is  the  coordinate  of  a  point  p  in  the  left 
frame,  fix,  fly,  and  Tz  are  the  three  motion-in-depth  parameters,  and  Z  is  the  spatial 
depth  of  the  corresponding  environmental  point  P. 

The  algorithm  proceeds  in  four  steps: 

1.  Extract  the  relative  optical  flow  field  between  the  left  and  right  images  using  the 
difference  between  the  two  optical  flow  fields,  along  with  the  disparity  field. 


2.  Use  Adiv’s  algorithm  to  segment  the  monocular  optic  flow  corresponding  to  the  left 
sensor.  This  segmentation  is  therefore  performed  using  only  motion  information  in 
the  2-D  image  plane  in  order  to  obtain  a  grouping  of  the  flow  vectors,  where  each 
segment  corresponds  to  the  motion  of  a  roughly  planar  surface. 

3.  Merge  the  segments  on  the  2-D  image  plane  (obtained  from  the  segmentation  step) 
based  on  a  least-square  minimization  to  compute  the  motion-in-depth  parameters 
for  each  of  the  merged  regions.  The  output  at  this  stage  is  a  grouping  of  the  image 
into  regions  that  correspond  to  the  same  set  (within  some  normalized  value  of  the 
deviation)  of  motion— in-depth  parameters. 


4.  Minimize  the  following  error  functional  over  each  set  of  relative  flow  vectors  corre¬ 
sponding  to  a  single  segment  or  possibly  a  merged  set  of  them: 

E(fix,nY,Tz)  =  ±W1  -  fiyxi  +  fixyi  +  ^r]\ 
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where  (x;,7/i)  denotes  an  image  point  in  the  set  in  question,  and  n  is  the  number  of 
elements  in  the  set. 

The  algorithm  was  run  on  synthetic  data  with  general  motion  of  both  the  sensor  and 
independently  moving  objects.  It  shows  good  performance  with  ideal  images  (i.e.,  no 
noise),  but  shows  some  degradation  of  performance  with  increasing  noise.  A  representative 
example  of  the  results  obtained  are  given  in  Figure  5  and  Table  4.  Work  is  currently 
underway  to  test  the  effectiveness  of  this  algorithm  on  real  scenes. 

One  of  the  most  important  problems  in  stereo  and  motion  processing  is  the  recovery  of 
depth  and  motion  boundaries.  A  number  of  algorithms  for  computing  optic  flow  make  a 
global  smoothness  assumption  that  tends  to  unnaturally  smooth  across  depth  and  motion 
discontinuities.  This  makes  later  detection  of  these  boundaries  very  difficult.  On  the 
other  hand,  knowledge  of  these  discontinuities  is  very  important  for  the  flow  and  disparity 
computations  to  be  correct,  especially  at  occlusion  boundaries. 

One  approach  to  this  problem  is  to  integrate  motion  and  stereo  data.  Balasubramanyam 
and  Weiss  [26]  use  information  in  both  the  stereo  and  motion  sequences  at  two  time  in¬ 
stances  to  define  a  confidence  measure  in  the  presence  of  motion  and  depth  discontinuities. 
This  measure  can  be  applied  early,  prior  to  the  full  computation  of  flow  and  disparity  fields. 
The  general  idea  is  to  use  coarse  disparity  and  flow  estimates  from  hierarchical  correlation 
processes  [10]  to  locate  and  label  depth  and  motion  discontinuities;  smoothing  is  then  in¬ 
hibited  across  these  boundaries.  Discontinuities  that  are  continuous  (i.e.  unbroken)  in  the 
other  dimension  are  favored.  The  results  of  running  this  algorithm  on  both  synthetic  and 
real  stereo-motion  imagery  are  presented  in  [26].  We  give  an  example  in  Figure  6. 


2.7  Smoothness  Constraints  for  Optical  Flow  and  Surface  Re¬ 
construction 

The  computation  of  optical  flow  normally  requires  a  constraint  on  the  variation  of  the 
flow  fields  from  constancy.  Snyder  [57]  has  given  an  axiomatic  derivation  of  the  possible 
smoothness  constraints  under  a  small  number  of  physically  reasonable  assumptions.  He 
shows  that  there  are  only  four  possible  smoothness  constraints  which  are  quadratic  in  first 
derivatives  of  the  optical  flow,  and  either  first  or  second  derivatives  of  the  image  intensity 
function  that  satisfy  these  assumptions.  He  also  gives  a  novel  geometric  interpretation  of 
these  smoothness  constraints,  and  shows  that  only  two  of  the  four  are  physically  sensible. 


2.8  Analysis  of  Constant  General  Motion 

Another  way  to  introduce  additional  constraints  to  the  problem  of  general  motion  analysis 
in  an  effort  to  achieve  practical,  robust  algorithms  is  via  Shariat’s  formulation:  constant  but 
arbitrary  general  motion  of  a  rigid  object  [54].  This  leads  to  a  set  of  difference  equations 
across  a  sequence  of  images,  relating  the  positions  of  a  feature  in  the  image  plane  to 
the  motion  parameters  of  the  projected  point.  The  solution  obtained  is  a  set  of  5th 
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Figure  5a. Simulated  ideal ,  dense  optic 
flow  field  for  the  left  camera. 


Figure  5d  Result  of  segmentation 
performed  using  Adiv’s  algorithm  [l|. 


Figure  5c  Simulated  ideal  dense  field 

of  disparity  vectors. 

Baseline  is  0.5  focal  units. 


FigureSb  Simulated  ideal,  dense  optic 
flow  field  for  the  right  camera. 


Figure  5e  Result  of  merger  in  the  optimization  step 
of  the  algorithm.  Note  that  the  independently  moving  sphere  is 
picked  out. 


Figure  5:  Stereoscopic  Motion. 

The  algorithm  of  Balasubramanyam  and  Snyder  applied  to  a  noisy  optical  flow  field.  The 
camera  motion  is  completely  general;  the  sphere  is  moving  independently  with  no  motion- 
in-depth  components,  while  the  ellipsoid  is  stationary. 
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sphere 

size 

position 

Object  Translation 
(focal  units) 

Object  Rotation 
(radians) 

2,2,2 

9,9,30 

Input 

Computed 

Input 

Computed 

Tx  =  0.50 

Ty  =  -0.5 

Tz  =  0.00 

Te7omp  =0.11 

a*  =  0.00 
nY  =  o.oo 
nz  =  ~o.i9 

n*"‘"  =  0.04 
n  ymp  =  o.o2 

ellipsoid 

size  . 

position 

Object  Translation 
(focal  units) 

Object  Rotation 
(radians) 

2,5.2 

-3, -1,20 

stationary 

plane 

Z  =  X  +  0.5Y  +  50 

stationary 

camera 

Camera  Translation 
(focal  units) 

Camera  Rotation 
(radians) 

Input 

Computed 

Input 

Computed 

Tx  —  0.50 
Ty  =  0.05 
Tz  =  1.0 

rpComp  _  J  ey 

Table  4:  General  Camera  Motion  with  Independent  Object  Motion. 


Figure  6:  The  Balasubramanyam  and  Weiss  Algorithm. 

The  four  video  images  at  the  top  of  the  page  are  the  stereo  motion  sequence.  The  images 
from  the  left  (right)  camera  are  on  the  left  (right).  The  earlier  stereo  pair  is  at  the  bottom, 
while  the  later  stereo  pair  is  at  the  top.  The  binary  image  at  the  bottom  of  the  page  shows 
the  flow  disparity  estimate. 
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order  non-linear  polynomial  equations  in  the  unknown  motion  parameters.  The  solution 
requires  a  Gauss-Newton  non-linear  least-squares  method  with  carefully  designed  initial 
guess  schemes.  Pavlin  [52]  has  derived  a  closed-form  solution  for  the  rigid  object  trajectory 
by  integrating  the  differential  equations  describing  the  motion  of  a  point  on  the  tracked 
object.  The  integrated  equations  are  non-linear  only  in  angular  velocity,  and  are  linear  in 
all  other  motion  parameters.  These  equations  allow  the  use  of  a  simple  least-square  error 
minimization  criterion  in  an  iterative  search  for  the  motion  parameters. 

2.9  Token-Based  Approaches  to  Motion  and  Perceptual  Organi- 
f  zation 

The  problems  cited  previously  with  respect  to  the  extraction  of  motion  and  depth  in¬ 
formation  using  traditional  optical  flow  techniques  have  led  us  toward  the  exploration  of 
methods  for  combining  the  local  flow/displacement  fields  with  larger  token-like  structures. 
It  is  our  position  that  the  inherently  local  measurement  of  visual  motion  provided  by 
optical  flow  is  insufficient  to  meet  the  varied  requirements  of  dynamic  image  understand¬ 
ing.  The  approach  we  developed  involves  computing  the  correspondence  between  tokens 
of  arbitrary  spatial  scale  produced  by  perceptual  organization  processes.  Such  tokens  of¬ 
ten  map  directly  to  environmental  structure,  and  descriptions  of  their  movement  often 
correlate  more  closely  with  the  motion  of  physical  objects  than  does  the  local  motion  in¬ 
formation  contained  in  the  displacement  field.  A  token  match  represents  more  than  just 
a  spatial  displacement;  also  explicit  in  this  representation  are  the  time-varying  values  of 
those  parameters  which  define  the  token,  or  which  can  be  extracted  from  the  structure  of 
the  token. 

The  work  of  Williams  and  Hanson  [65,66]  describes  work  in  progress  toward  this  goal. 
The  premise  of  this  work  is  that  the  structure  obtained  from  perceptual  organization 
processes  can  be  combined  with  the  local  motion  information  contained  in  the  flow  field 
to  provide  a  more  robust  estimate  of  motion  and  depth  parameters.  The  approach  can  be 
viewed  as  augmenting  the  rather  limited  use  of  spatial  structure  in  traditional  approaches 
with  the  richer  descriptive  vocabulary  of  spatial  structure  provided  by  the  perceptual 
organizational  processes  over  both  space  and  time.  In  this  sense,  the  spatially  organized 
structures  (such  as  lines,  regions,  curves,  vertices,  intersections,  rectangular  groups,  etc.), 
which  are  actively  constructed  from  the  image  can  be  considered  to  be  interest  operators 
of  large  spatial  extent. 

In  their  first  paper  [65],  a  method  for  computing  the  temporal  correspondence  between 
straight  line  segments  is  presented.  We  consider  the  two  frame  case  here,  but  the  method  is 
extensible,  and  has  been  extended,  to  multiple  frames.  A  straight  line  perceptual  organiza¬ 
tion  process  developed  by  Boldt  and  Weiss  [27,64]  is  applied  to  both  frames  independently 
to  provide  straight  lines  in  each  frame.  A  displacement  field  is  also  computed  from  the  two 
frames  using  the  algorithm  developed  by  Anandan  [9,10].  After  filtering  the  straight  lines 
on  length  and  constrast  to  reduce  the  line  set  in  both  images,  the  displacement  field  is  used 
to  construct  a  search  area  in  Frame  2  for  each  line  in  Frame  1.  Since  a  one-to-one  corre- 
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spondence  between  lines  is  unlikely,  a  minimal  mapping  approach  [61]  is  used  to  compute 
the  correspondence  between  the  Frame  1  and  Frame  2  line  sets;  such  a  mapping  is  called 
a  minimal  bipartite  cover.  The  similarity  measure  used  to  compute  the  cover  involves 
the  similarity  and  spatial  separation  of  the  candidate  token  matches.  By  computing  the 
connected  components  of  the  bipartite  graph,  the  global  matching  problem  is  conveniently 
divided  into  smaller,  individually  tractable  pieces  which  reflect  the  scope  of  potential  in¬ 
teractions.  A  simple  blind  search  of  the  subgraphs  is  used  to  extract  the  bipartite  cover 
minimizing  the  positional  and  similarity  discrepancy  metric. 

The  matching  results  obtained  are  quite  good.  The  system  has  been  run  repeatedly 
t  on  successive  frames  of  several  multi-frame  sequences.  In  the  multi-frame  case,  a  directed 

acyclic  graph  is  constructed  which  represents  the  splitting  and  merging  patterns  of  line 
segments  over  time.  Work  is  in  progress  to  analyze  the  trajectories  of  the  tokens  over  time. 
In  Figure  7,  we  show  the  first  frame  of  an  image  sequence  of  a  soccer  ball,  the  computed 
displacement  field,  the  line  tokens,  and  the  output  of  the  matching  process  for  selected 
lines. 

In  their  second  paper  [66],  a  method  for  computing  depth  from  the  line  correspondences 
is  described  using  the  temporal  change  in  the  length  of  virtual  lines  constructed  from  the 
intersections  of  the  Boldt  lines  [27].  They  use  virtual  lines  because  the  length  of  the  original 
lines  is  not  reliable,  although  their  orientation  and  lateral  displacement  are  quite  precise. 
This  “looming”  method  is  also  generalized  to  areas.  The  method  is  generally  applicable  to 
any  structure  whose  total  extent  in  depth  is  small  compared  to  the  depth  of  its  centroid 
(that  is,  for  those  cases  in  which  perspective  projection  can  be  approximated  by  scaled 
orthographic  projection  [59])  and  which  does  not  exhibit  any  independent  motion.  The 
technique  does  not  depend  on  the  complete  determination  of  the  egomotion  parameters  of 
the  sensor,  but  it  does  require  the  computation  of  the  component  of  the  sensor’s  translation 
in  the  direction  of  motion.  An  analysis  of  the  sensitivity  of  the  algorithm  to  errors  in  the 
measured  variables  is  planned  for  the  near  future;  experimental  results  on  real  image 
sequences  suggest  that  the  algorithm  may  be  quite  robust.  In  Figure  8,  we  show  the 
first  frame  of  an  indoor  motion  sequence  taken  by  our  mobile  robot.  In  Figure  9,  we 
show  the  line  segments  used  to  define  virtual  lines  and  virtual  regions.  In  Tables  5  and 
6,  we  show  the  experimental  results  for  depth  using  the  virtual  lines  and  virtual  regions, 
respectively.  The  error  in  depth  seems  to  be  around  5%;  this  is  a  promising  result,  but 
further  experimentation  is  necessary. 

2.10  3-D  Interpretation  of  Rotational  Motion  from  Image  Tra¬ 

jectories 

The  research  of  Sawhney  and  Oliensis  [53]  addresses  the  problem  of  finding  the  motion 
parameters  of  independently  moving  objects  in  their  natural  coordinate  system.  They 
analyze  an  extended  time  sequence  of  images  of  an  object  rotating  uniformly  around  an  axis 
of  arbitrary  location  and  orientation,  and  demonstrate  how  the  abstraction  of  continuous 
descriptions  of  multi-frame  data  can  lead  to  the  recovery  of  scene  motion  and  structure. 
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Figure  8:  Mobile  Robot  Image  Sequence. 

The  first  frame  of  a  motion  sequence  taken  by  a  mobile  robot  moving  down  the  hallway. 
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Virtual  Line 

Depth  (ft.) 

Ground  Truth  (ft.) 

%  Error 

t 

Cone  1 

19.1 

20.0 

.4.5 

1 

Cone  2 

23.6 

25.0  - 

’  5.6 

3 

Cone  3 

28.3 

35.0 

!  19.1 

1 

Cone  4 

42.1  ' 

40.0 

5.3 

7 

Can  1 

29.0 

30.0 

3.3 

7 

Wall  1 

27.7 

27.1 

2.2 

2 

Wall  2 

48.8 

48.7 

0.2 

7 

Doorway 

88.8 

87.1  1 

2.0 

7 

Table  5:  Comparison  of  the  Computed 
Depth  for  the  Virtual  Lines. 
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Virtual  Region. 

Depth  (ft.) 

Ground  Truth  (ft.) 

%  Error 

t 

Cone  1 

20.1 

20.0 

0.5 

1 

Cone  2 

25.8 

25.0 

3.2 

3 

Cone  3 

35.5  - 

35.0 

1.4 

1 

Cone  4 

40. 0" 

40.0 

0.0 

7 

Table  6:  Comparison  of  the  Computed  and  the 
Depth  for  the  Virtual  Regions. 
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Image  traces  of  3-D  feature  points  are  generated  from  image  point  correspondences  over  a 
sequence  of  frames.  These  traces  are  described  by  continuous  curves  that  are  obtained  by 
fitting  conic  arcs  to  the  set  of  points.  The  goal  is  motion-based  grouping  of  image  traces 
to  provide  constraints  (unavailable  in  only  a  few  frames)  sufficient  to  extract  the  motion 
parameters  of  independently  moving  objects  in  their  natural  coordinate  system. 

2.11  A  Motion  Data  Set  from  the  Autonomous  Land  Vehicle 
(ALV) 

A  major  difficulty  with  the  analysis  of  motion  algorithms  has  been  the  lack  of  motion  data 
with  ground  truth  of  known  precision.  In  particular,  these  data  have  not  been  collected 
for  robot  vehicles  operating  under  realistic  conditions  in  outdoor  environments.  Thus,  the 
proper  scientific  evaluation  of  motion  algorithms  intended  for  practical  application  has 
been  impossible. 

In  response  to  this  general  problem,  our  group  decided  to  collect  a  reasonably  large  data 
set  from  the  ALV  [35,36].  Motion  sequences  of  about  30  frames  each  were  collected  at  five 
different  outdoor  sites  with  different  road  surfaces,  including  on-road,  dirt-road,  and  off¬ 
road  scenarios.  Data  from  the  video  camera,  laser  range  finder,  and  land  navigation  system 
(LNS)  were  recorded  simultaneously  under  stop-and-shoot  and  move-and-shoot  scenarios. 
Ground  truth  data  for  the  3-D  environment  were  obtained  using  traditional  surveying 
methods,  while  the  LNS  provided  ground  truth  data  for  the  motion  parameters.  This 
motion  data  set  is  available  to  the  general  community  and  can  be  obtained  by  contacting 
Ms.  Valerie  Cohen  at  the  University  of  Massachusetts  (UMass)  at  Amherst  (E-mail  address 
is  VCohen@CS.UMass.EDU). 
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3  Mobile  Robot  Navigation 

Vision-based  mobile  robot  navigation  is  a  relatively  recent  addition  to  the  VISIONS  re¬ 
search  group  at  UMass.  We  have  acquired  a  mobile  robot  (called  HARV)  that  will  enable 
us  to  develop  a  testbed  for  many  of  the  vision  algorithms  that  we  have  developed  and 
continue  to  develop.  The  robot  is  to  be  operated  both  indoors  and  out,  providing  a  wide 
variety  of  scenes  for  analysis.  The  integration  of  robot  planning,  perception,  and  motor 
control  systems  for  effective  navigation  is  the  focus  of  continuing  work,  beginning  with  the 
work  of  Arkin  [13]  and  continuing  with  the  more  recent  work  of  Fennema  [37,38,39]. 

3.1  AuRA — the  Autonomous  Robot  Architecture 

Arkin  developed  an  integrated  system,  the  UMass  Autonomous  Robot  Architecture  (AuRA) 
[11,12,13,14,15,16,17,18,19,20,21,22],  to  support  this  research  effort.  It  incorporated  both 
global  and  reflexive  schema-based  path  planning  strategies  and  utilized  a  priori  knowl¬ 
edge  stored  in  long-term  memory,  when  available,  to  assist  the  vehicle’s  attainment  of  its 
navigational  goals. 

AuRA  has  five  major  components:  the  planning,  cartographic,  perception,  motor,  and 
homeostatic  subsystems.  A  block  diagram  of  AuRA  is  presented  in  Figure  10. 

The  purpose  of  the  hierarchical  planning  subsystem  is  to  handle  the  task  of  path  plan¬ 
ning  in  both  indoor  and  outdoor  environments.  The  cartographic  subsystem  maintains 
the  information  in  long—  and  short-term  memory  (which  store  a  priori  and  acquired  world 
knowledge,  respectively),  and  supplies  it  on  demand  to  the  planning  and  perception  mod¬ 
ules.  The  perception  subsystem  processes  all  the  sensory  information  from  the  environ¬ 
ment,  interprets  it,  and  delivers  it  to  the  cartographic  subsystem.  The  motor  subsystem 
controls  the  motion  of  the  vehicle.  Finally,  the  homeostatic  subsystem  is  concerned  with 
maintaining  a  safe  internal  environment  for  the  robot. 

The  chief  navigational  issues  addressed  in  the  work  of  Arkin,  and  also  that  of  Fennema, 
include  path  following,  landmark  recognition  for  vehicle  localization,  and  obstacle  avoid¬ 
ance.  A  new  fast  line  finding  algorithm  [46]  was  used  for  hall  and  sidewalk  navigation 
and  for  localization  purposes.  Our  depth-from-motion  algorithms  are  used  for  obstacle 
avoidance,  and  can  also  provide  information  for  landmark  identification  when  coupled  with 
top-down  knowledge  of  expected  landmark  locations.  A  new  fast  region  segmentation  algo¬ 
rithm  [32]  has  found  potential  application  in  both  path  following  and  vehicle  localization. 
A  description  of  all  these  algorithms  and  their  use  within  AuRA  can  be  found  in  [13]. 

Arkin  is  now  at  the  Georgia  Institute  of  Technology,  continuing  the  development  of 
AuRA.  Fennema  has  built  on  our  experience  with  Arkin’s  systems  to  develop  new  systems 
for  model-directed  navigation,  described  in  the  next  section. 
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3.2  Planning  and  Control  via  Milestones  for  Model — Directed 
Navigation 

Our  mobile  robot,  HARV,  begins  with  an  accurate,  but  incomplete,  model  of  the  world 
implemented  in  GeoMeter  (Section  3.3).  Each  task  given  to  HARV  is  translated  by  a 
command  interpreter  and  problem  solver  which  ultimately  produces  a  set  of  navigational 
goals.  The  execution  of  these  goals  is  accomplished  by  a  tight  interweaving  of  planning,  per¬ 
ception,  and  action,  orchestrated  by  a  dynamic  planning  and  execution  scheme  [37,38,39]. 
This  subsystem  works  with  plans,  each  represented  as  a  sequence  (MO  Al  Ml  . . .  An  Ain) 
of  milestones  Aik  and  proposed  actions  Ak .  Milestones  are  constructed  from  perceivable 
events,  and  are  used  to  verify  the  successful  completion  of  a  particular  phase  of  the  plan. 
As  used  here,  milestones  are  composed  of  3-D  landmarks  (perceivable  physical  events) 
and  their  expected  location  with  respect  to  the  robot  at  the  completion  of  the  appropri¬ 
ate  phase  of  the  plan.  They  allow  the  progress  of  the  plan  to  be  monitored  and  to  trigger 
replanning  before  the  next  action  is  taken  when  perception  and  the  milestone  do  not  agree. 

Planning,  perception,  and  execution  are  directed  by  the  plan-and-monitor  executive 
in  such  a  way  as  to  dynamically  modify  and  refine  the  plan  to  fit  the  actual  results  of  each 
action  and  the  details  of  the  perceived  environment.  The  principal  activities  involved  in 
this  process  are  planning,  milestone  recognition,  determination  of  location,  and  execution 
of  primitive  actions.  Interweaving  perception,  planning,  and  action  in  this  way  makes 
specific  what  task  is  expected  of  perception,  and  provides  a  way  of  focusing  the  available 
knowledge  to  that  end.  The  result  is  a  distribution  of  perception  and  perceptual  reasoning 
into  all  aspects  of  navigation. 

The  actual  motion  in  response  to  the  plan  is  produced  by  the  plan-and-execute  mod¬ 
ule.  This  motion  is  controlled  using  perceptual  servoing.  Perceptual  servoing  determines 
the  robot’s  motion  by  enforcing  control  at  several  levels:  action-level  servoing  ensures 
accurate  execution  of  each  primitive  action;  plan-level  servoing  uses  vision  to  ensure  that 
the  accumulation  of  primitive  actions  conforms  to  a  plan;  and  goal-level  servoing  ensures 
that  overall  action  is  directed  to  the  goal.  Each  level  uses  model-directed  vision  and  com¬ 
pares  what  is  sensed  to  what  is  expected,  and  issues  corrective  actions  to  minimize  any 
difference.  The  detailed  explanation  of  each  of  these  can  be  found  in  the  work  of  Fennema, 
et  al.  [37,38,39]. 

3.3  GeoMeter 

Models  of  the  vehicle’s  environment  are  built  using  GeoMeter,  a  three-dimensional  solid 
modelling  package  developed  jointly  by  UMass  and  the  General  Electric  Research  and 
Development  Center  [33].  GeoMeter  is  implemented  in  CommonLisp  and  is  oriented  to¬ 
wards  image  understanding  research  (although  it  has  many  other  potential  applications). 
It  currently  runs  on  several  types  of  workstations,  including  Symbolics  LISP  machines,  TI 
Explorers,  VAX  workstations,  and  SUN  workstations.  Work  is  under  way  to  allow  it  to 
run  on  the  Sequent  Balance  2000. 
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GeoMeter  adopts  the  language  of  simplicial  complexes  in  algebraic  topology  for  de¬ 
scribing  surfaces.  It  provides  generality  and  an  explicit  representation  of  edges,  vertices, 
and  faces.  Each  of  these  serve  as  a  type  of  geometric  primitive,  and  can  be  parametrized 
as  a  smooth  function  from  a  point,  unit  interval,  and  triangle  to  R3,  respectively.  Surfaces 
are  constructed  as  the  union  of  these  primitives,  and  are  denoted  by  a  sum  of  simplices. 
This  representation  produces  a  triangulation  of  the  surface,  where  the  triangles  are  not 
necessarily  planar. 

GeoMeter  has  two  basic  parts:  a  geometric  section,  and  an  analytic  section.  The  three 
basic  entities  which  the  geometric  section  uses  to  represent  sets  of  points  are  the  vertex, 
the  edge,  and  the  face.  These  are  then  composed  to  represent  solid  objects.  Topological 
structures  are  then  used  to  define  the  connectivity  between  the  sets  of  the  model,  and  solid 
objects  are  built  hierarchically  starting  with  vertices,  then  edges,  then  faces. 

The  analytic  section  of  GeoMeter  is  devoted  to  the  manipulation  of  polynomials  and 
transcendental  functions.  This  is  of  interest  because  these  functions  permit  the  exact 
description  of  curved  surfaces,  and  also  because  such  manipulations  provide  a  mechanism 
for  performing  algebraic  deduction,  which  is  useful  in  reasoning  about  geometric  relations. 

We  have  surveyed  a  portion  of  the  UMass  campus  and  have  used  GeoMeter  to  construct 
a  3-D  model,  including  buildings,  sidewalks,  lampposts,  telephone  poles,  etc.  This  model 
has  been  annotated  with  properties  of  objects  and  surfaces  which  are  useful  to  the  planning 
and  vision  routines  used  by  our  mobile  robot  HARV.  Although  this  cannot  include  every 
visible  entity  (e.g.,  dirt  patches  within  grassy  areas),  most  of  the  significant  stationary  ob¬ 
jects  in  the  environment  have  been  represented  in  the  model.  Finally,  the  entire  model  has 
been  placed  in  a  space-organizing  data  structure,  which  divides  3-D  space  into  “locales,” 
or  space  packets,  that  are  used  for  planning  and  for  locating  the  robot.  In  Figure  11,  we 
show  how  GeoMeter  models  the  area  around  our  building. 

3.4  2— D  Model  Matching 

An  important  problem  in  model— driven  3— D  interpretation  is  how  to  use  approximate 
knowledge  of  the  location  and  orientation  of  the  sensor,  models  of  objects  in  the  environ¬ 
ment,  and  the  results  of  low-level  vision  to  determine  the  image-to-model  correspondence. 
The  approach  we  have  taken  is  to  separate  2-D  model-to-image  matching  from  the  deter¬ 
mination  of  the  3— D  pose  parameters  (see  section  3.5).  We  believe  this  approach  will  be 
more  robust. 

Beveridge,  et  al.  [30,31]  assume  that  a  2-D  model  has  been  supplied  with  rough 
constraints  on  its  image  position  (e.g.,  via  an  approximate  3— D  location  in  a  modelled 
environment).  This  substantially  reduces  the  search  space  of  possible  model— image  line 
correspondences.  The  goal  here  is  to  determine  correspondences  between  model  and  data 
lines  such  that  an  optimized  spatial  fit  will  produce  the  lowest  match  error.  The  search 
must  be  carried  out  across  the  space  of  possible  line  correspondences.  This  involves  dealing 
with  the  complexities  of  grouping  fragmented  data  and  missing  or  erroneous  lines.  The 
rotation  and  translation  of  the  model  that  minimizes  the  error  in  spatial  fit  for  a  given  set 
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Figure  11:  The  GeoMeter  Model  of  the  Area  Around  the  Graduate 
Research  Center  at  UMass. 

11a.  Geometer  model  of  the  area  around  the  Graduate  Research  Center,  lib.  A  more 
detailed  Geometer  model  (with  hidden  lines  removed)  of  the  same  area  shown  in  11a.  Note 
that  additional  landmarks,  such  as  telephone  poles,  have  been  added. 


of  line  correspondences  is  computed  via  a  closed-form  solution. 

In  more  detail,  the  basic  steps  of  the  model-matching  algorithm  are: 

1.  Determine  the  search  space  of  correspondences.  Lacking  constraints  on  model  po¬ 
sition,  all  data  line  segments  possibly  correspond  to  every  model  line  segment.  If 
constraints  are  available,  only  associations  of  model  and  data  lines  satisfying  these 
constraints  need  be  considered. 

Determine  promising  model  positions  if  the  search  space  is  large.  Use  these  posi¬ 
tions  to  determine  constrained  search  subspaces  made  up  only  of  correspondences 
consistent  with  the  estimated  position.  A  promising  model  position  may  be  found 
either  through  a  generalized  Hough  transform  or  by  identifying  prominent  features. 
The  generalized  Hough  technique  involves  an  analysis  of  the  space  of  possible  two- 
dimensional  spatial  transforms  necessary  to  bring  the  model  and  the  data  into  align¬ 
ment.  The  identification  of  a  prominent  feature  may  involve  finding  a  distinctive 
part  of  a  model  such  as  a  corner,  then  using  that  to  position  the  model  as  a  whole. 

3.  For  each  of  the  constrained  search  spaces  (sets  of  possible  model-data  correspon¬ 
dences)  obtained  above,  use  iterative  refinement  to  determine  a  best  match.  After 
each  iteration,  perturb  the  correspondence,  adding  or  deleting  one  or  several  data 
lines,  and  then  determine  the  new  best— fit  model  position  and  related  match  error.  If 
the  match  error  is  thereby  reduced,  adopt  the  improved  match;  stop  when  the  match 
can  no  longer  be  improved.  The  best  of  the  resulting  matches  is  taken  as  the  final 
match. 

This  algorithm  has  achieved  interesting  results  when  used  on  images  from  our  mobile 
robot  domain.  In  Figure  12,  we  show  a  512  x  512  image  of  the  area  around  our  building, 
taken  from  the  mobile  robot  HARV.  In  Figure  13a,  we  show  six  navigational  landmarks 
obtained  using  GeoMeter.  In  Figure  13b,  we  show  the  result  of  applying  the  2-D  model 
matcher  to  the  image.  We  see  that  the  matcher  has  correctly  found  the  data  segments 
which  match  the  landmark  lines. 

3.5  3-D  Pose  Refinement 

Kumar  [47]  has  developed  an  optimization  technique  for  finding  the  3-D  sensor  pose  given  a 
set  of  correspondences  between  3-D  model  lines  and  2-D  image  lines.  The  3-D  pose  is  given 
by  the  rotation  and  translation  matrices  which  map  the  world  coordinate  system  to  the 
sensor  coordinate  system.  Using  the  output  of  the  system  described  in  the  previous  section, 
these  algorithms  allow  updating  of  the  mobile  robot  position  via  landmark  recognition. 

Previous  researchers,  e.g.,  Liu,  et  al.  [50],  have  decomposed  this  problem  into  two 
stages:  first  solve  for  the  rotation,  and  then  solve  for  the  translation.  The  problem  with 
this  approach  is  that  the  rotation  and  translation  constraints,  when  used  separately,  are 
very  weak  constraints,  such  that  even  small  errors  in  the  rotation  stage  can  be  amplified 
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Figure  12:  A  512  x  512  Image  Taken  With  Our  Mobile  Robot  HAEV. 


into  large  errors  in  the  rotation  stage.  In  Kumar’s  work,  rotation  and  translation  are  solved 
for  simultaneously  using  an  algorithm  called  “R-and-T.”  The  regnant  constraints  for  this 

approach  are  much  tighter,  and  hence  are  much  more  immune  to  noise  than  previous 
approaches. 

The  technique  used  to  solve  for  the  optimal  rotation  and  translation  is  adapted  from 
the  work  of  Horn  [45]  on  the  problem  of  relative  orientation.  Kumar  minimizes  the  ob¬ 
jective  function,  which  measures  the  error  between  the  data  and  a  presumed  rotation  and 
translation,  by  first  estimating  the  rotation  and  translation.  He  then  linearizes  the  error 
term  about  this  estimate  and  makes  iterative  adjustments  to  the  rotation  and  translation 
that  reduce  this  error.  The  iterations  are  continued  until  the  algorithm  converges  to  a  min¬ 
imum.  This  nonlinear  least-squares  optimization  technique  has  much  better  convergence 
properties  than  does  Liu,  et  al.’s  solution  method  based  on  Euler  angles.  The  algorithm 
has  been  tested  on  both  synthetic  and  real  images,  with  good  results  (see  Table  7). 

For  practical  applications,  the  issue  of  computational  speed  is  critical.  The  acquisition 
of  parallel  hardware,  a  Sequent  multiprocessor,  will  decrease  the  processing  time  required 
for  both  vision  and  motor  tasks  and  is  expected  to  enhance  the  real-time  capabilities  of  the 
mobile  robot  project.  We  are  in  the  process  of  porting  our  algorithms  for  robot  navigation 
onto  the  Sequent,  and  will  be  doing  timing  experiments.  An  additional  piece  of  hardware 
is  the  UMass  Image  Understanding  Architecture  (IUA)  currently  being  developed  under 
another  DARPA— sponsored  contract  [63].  The  IUA  is  a  three  level  board  (64  x  64)  which 
has  been  designed  to  deal  with  with  the  different  levels  of  computation  that  one  typically 
finds  in  vision  tasks,  and  should  be  able  to  operate  at  speeds  that  allow  real-time  vehicle 
control.  When  it  is  complete,  we  believe  that  real-time  processing  for  most  of  the  vision 
and  robotics  navigation  algorithms  will  be  feasible. 
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1  NOISE 

ROTATION  ERROR 

TRANSLATION  ERROR 

No. 

6 

P 

<5u/r 

6w9 

6wx 

AT, 

AT, 

AT, 

Lines 

deg. 

pixels 

deg. 

deg. 

deg. 

feet 

feet 

feet 

Correct 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

5 

1.0 

1.0 

0.24 

0.15 

0.04 

0.21 

2.03 

1.16 

5 

5.0 

5.0 

1.20 

0.79 

0.19 

1.08 

10.14 

6.20 

5 

1.0 

5.0 

0.24 

0.16 

0.04 

0.21 

2.04 

1.18 

5 

5.0 

1.0 

1.19 

0.78 

0.19 

1.08 

10.14 

6.20 

10 

1.0 

1.0 

0.21 

0.08 

0.05 

0.02 

1.73 

0.08 

10 

5.0 

5.0 

0.72 

0.27 

0.31 

0.18 

6.33 

0.48 

14 

1.0 

1.0 

0.07 

0.06 

0.08 

0.03 

0.77 

0.02 

14 

5.0 

5.0 

0.34 

0.30 

0.39 

0.17 

3.80 

0.12 

30  i 

1.0 

1.0 

0.03 

0.05 

0.06 

0.06 

0.48 

0.06 

30 

5.0 

5.0 

0.16 

0.24 

0.31 

0.32 

2.39 

0.32 

Table  7:  Average  Absolute  Error  of  Translation  and  Rotation  for 
the  R— and— T  Algorithm. 

The  average  for  each  experiment  is  taken  over  100  samples  of  uniform  noise. 
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4  Conclusions 

This  section  presents  the  conclusions  drawn  from  the  research  performed  under  this  con¬ 
tract. 

•  Most  motion  is  not  translational.  There  is  no  such  thing  as  uniform  trans- 
lational  motion,  except  in  very  strictly  controlled  situations.  In  the  absence  of  a 
gyro— stabilized  sensor,  there  are  usually  rotational  motion  components  in  excess  of 
1°  for  real  image  sequences.  Algorithms  which  assume  uniform  translational  motion 
in  order  to  calculate  quantitative  information  can  be  expected  to  performly  poorly 
in  such  realistic  situations.  They  will  therefore  be  of  little  use  for  tasks  which  re¬ 
quire  accurate  quantitative  information,  such  as  computing  structure  from  motion, 
unless  the  objects  are  quite  close  to  the  sensor.  They  may  be  of  some  use  for  more 
qualitative  tasks  such  as  avoidance  of  distant  objects,  or  for  navigation. 

•  General  NIotion  is  Necessary.  In  practical  situations,  general  motion  algorithms 
will  be  necessary  for  any  quantitative  task.  Our  combination  of  the  Anandan  and 
Adiv  algorithms  to  obtain  a  general  motion  algorithm  shows  promise  and  seems  to 
be  able  to  find  environmental  depth  with  an  error  of  less  than  about  10%. 

•  Stereoscopic  Motion  May  Be  Useful.  This  is  an  alternative  to  the  general  mo¬ 
tion  algorithms.  Although  we  do  not  yet  have  much  experimental  data  on  algorithms 
which  combine  stereo  and  motion,  we  think  the  initial  results  are  promising. 

•  Longer  Image  Sequences  Should  Improve  Robustness.  One  way  of  achieving 
good  performance  for  monocular  image  sequences  is  to  use  longer  image  sequences. 
The  additional  information  and  constraints  provided  by  such  sequences  should  lead 
to  more  robust  results. 

•  Algorithms  Must  Be  Evaluated  Scientifically.  Accurate  ground  truth  is  needed 
to  have  a  quantitative  metric  for  the  evaluation  of  an  algorithm  s  performance.  The 
scientific  evaluation  of  such  an  algorithm  cannot  be  performed  if  you  don’t  know 
what  you  were  supposed  to  get. 

.  Landmarks  are  Useful.  The  use  of  landmarks  in  model— based  vision  appears  to 
be  feasible.  This  means  that  models  of  the  environment  are  needed.  The  acquisition 
of  such  models  is  a  non— trivial  problem  in  itself. 

.  The  Decomposition  of  2-D  and  3-D  Processing  is  Useful  for  Navigation. 
The  process  of  correspondence  between  image  data  and  model  data  is  complicated 
by  sensory  data  that  are  noisy  (e.g.,  skewed  and  translated  lines),  fragmented,  and 
missing  elements.  The  recovery  of  3-D  pose  can  be  simplified  if  the  problem  is 
decomposed  into  2-D  optimization  of  line  correspondences  during  model-matching, 
followed  by  3— D  optimization  of  the  robot’s  position  and  orientation. 
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5  Recommendations 


In  this  section,  we  detail  our  recommendations  for  the  direction  in  which  motion  research 
supported  by  the  present  contract  should  head.  First  we  outline  the  directions  for  motion 
research,  and  then  we  present  recommendations  for  future  research  in  mobile  robotics. 

5.1  Directions  for  Motion  Research 

•  Motion  algorithms  must  be  precise.  Motion  algorithms  that  derive  depth  from 
an  analysis  of  sensor  motion  must  be  capable  of  recovering  the  parameters  of  general 
motion  with  rotational  accuracies  of  much  less  than  one  degree.  If  the  algorithm 
cannot  perform  to  this  level,  it  will  be  difficult  to  recover  the  environmental  depth  of 
surfaces  that  are  at  medium  distances  from  the  sensor  (for  example,  40  feet  or  more 
from  the  sensor,  when  the  sensor  moves  2  feet  between  frames).  The  general  motion 
algorithm  of  Adiv  has  shown  significant  promise  in  recovering  the  depths  of  outdoor 
objects  with  less  than  10%  error.  The  robustness  of  such  general  motion  algorithms 
must  be  carefully  evaluated  on  many  sequences  of  controlled  image  data. 

•  Motion  algorithms  must  be  compared  with  ground  truth.  The  motion  data 
set  obtained  by  us  at  Martin  Marietta  has  known  ground  truth  for  both  environmen¬ 
tal  depth  and  sensor  motion  parameters.  It  can  therefore  serve  as  the  touchstone  for 
the  scientific  evaluation  of  the  accuracy  of  motion  algorithms.  This  data  set  is  being 
made  widely  available;  we  intend  to  utilize  it  extensively. 

•  Motion  and  stereo  should  be  used  together.  Efforts  to  combine  motion  and 
stereo  should  be  extended  from  the  analysis  of  synthetic  laboratory  data  and  applied 
to  real  scenes.  Such  algorithms  also  promise  the  possibility  of  dealing  with  general 
motion.  The  goal  here  should  be  an  algorithm  that  initially  recovers  a  coarse  ap¬ 
proximation  to  surfaces  over  the  first  few  frames  of  the  image  sequence,  and  then 
continuously  refines  the  surface  to  form  a  better  approximation.  The  detection  of 
occlusion  boundaries  and  depth  discontinuities  will  be  critical  to  the  success  of  this 
effort.  The  performance  of  such  algorithms  should  be  compared  with  the  perfor¬ 
mance  of  general  motion  algorithms  (such  as  that  of  Adiv)  for  the  recovery  of  sensor 
motion  and  environmental  depth. 

•  Trajectories  should  be  used.  Long  temporal  sequences  should  be  useful  for  any 
motion  motion  algorithm.  The  development  of  token-based  tracking  algorithms  (such 
as  the  line-tracking  algorithm  of  Williams  and  Hanson)  is  needed  to  extract  the 
trajectories  of  tokens  across  sequences.  As  two  tokens  of  the  same  type  cross  each 
other,  as  frequently  occurs,  the  match  becomes  ambiguous  and  the  tracking  sequence 
is  disrupted.  If  the  image  trajectories  were  fit  via  smooth  curves,  they  could  be 
unambiguously  matched,  and  in  fact  their  crossing  and  occlusion  could  be  predicted. 
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Obtaining  the  trajectories  of  tokens  can  also  provide  critical  information  for  the 
organization  of  moving  objects  and  the  recovery  of  their  natural  coordinate  systems. 

•  Approaches  to  top-down  surface  extraction  for  both  static  and  moving 
objects  should  be  investigated.  The  goal  here  would  be  to  make  use  of  a  static 
3-D  representation  of  the  environment,  and  the  approximate  location  of  the  vehicle, 
which  is  often  available.  In  addition,  the  system  could  be  provided  with  a  model  of 
objects  that  are  capable  of  locomotion,  such  as  people,  cars,  or  bicycles.  Thus,  direct 
extraction  of  the  motion  parameters  of  a  surface  may  be  possible  by  using  specific 
or  general  surface  models.  Furthermore,  the  extraction  and  refinement  of  the  depth 
of  the  surfaces  would  be  enhanced  by  jointly  processing  the  image  motions  of  a  set 
of  points,  or  an  area,  with  the  knowledge  of  the  possible  or  probable  surface  models 
that  can  explain  the  image  data. 

5.2  Directions  for  Mobile  Robot  Research 

•  Evaluate  the  efficacy  of  using  accurate  3-D  knowledge  of  the  environment. 
The  3-D  representations  and  knowledge  base  must  serve  as  a  map  for  path  planning 
and  navigation,  as  well  as  for  maintaining  descriptions  of  objects  for  goal  and  land¬ 
mark  recognition.  We  intend  to  use  this  representation  (using  GeoMeter)  to  capture 
two  local  environments,  the  interior  hallways  of  our  building,  and  the  outside  of  our 
building,  for  experiments  in  vehicle  navigation  and  to  test  a  variety  of  navigation 
tasks. 

•  Use  a  wider  range  of  knowledge  about  the  environment,  e.g.,  color  and 
texture.  The  model  of  the  environment  can  be  enriched  with  information  that  rep¬ 
resents  more  qualitative  spatial  constraints  than  those  obtained  using  a  3-D  modeller 
(such  as  GeoMeter).  This  information  can  be  captured  in  a  manner  similar  to  the 
road  scene  models  in  the  current  knowledge  base  of  the  VISIONS  system.  By  using 
this  methodology,  the  areas  of  the  image  which  cannot  be  conveniently  represented 
as  wire-frame  models,  such  as  vegetation  or  distant  mountains,  can  all  be  added  to 
the  tight  geometric  models  to  provide  additional  knowledge  for  object  recognition 
and  navigation. 

•  Further  develop  landmark-based  navigation  strategies.  The  ability  to  relate 
image  events  to  stored  models  of  objects  and  landmarks  will  be  crucial  to  utilizing 
the  knowledge  of  the  environment  that  is  stored  in  a  map.  If  specific  landmarks  can 
be  recognized,  then  their  location  on  a  map  can  be  used  to  determine  the  location  of 
the  vehicle  in  the  environment,  or  at  least  to  reduce  the  uncertainty  in  the  vehicle’s 
position  and  orientation.  In  addition,  this  will  be  necessary  to  achieve  goals,  since 
the  specification  of  goals  will  often  involve  relationships  to  objects.  The  accuracy  of 
these  landmark-recognition  algorithms  across  a  variety  of  landmark/ object  models 
and  at  a  range  of  distances  from  the  sensor  should  be  evaluated.  We  expect  to 
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demonstrate  that  a  3-D  model  and  model-based  vision  algorithms  can  be  used  to 
effectively  navigate  from  an  approximately  known  starting  location  to  another  desired 
location. 

Supplement  model— based  algorithms  with  stereo  and  motion  algorithms. 
Model-based  algorithms  will  not  work  well  if  an  unmodelled  object  is  encountered  by 
the  robot.  Motion  and  stereo  algorithms  should  therefore  be  used  to  supplement  the 
static  recognition  of  landmarks  by  providing  the  depth  of  points,  lines,  and  surfaces 
as  a  function  of  bottom-up  processing  of  an  image  sequence.  This  information  would 
then  be  useful  for  such  tasks  as  obstacle  avoidance  and  the  automatic  acquisition  of 
3-D  models. 

Use  learning  to  automatically  acquire  object  and  scene  models.  The  in¬ 
formation  required  for  object  recognition  strategies  can  be  time-consuming  if  con¬ 
structed  entirely  by  hand.  It  is  possible  that  a  training  set  of  interpreted  scenes  can 
be  used  to  automatically  acquire  object  schema  knowledge.  Some  of  the  attributes  of 
object  classes  such  as  color,  texture,  size,  shape,  or  location  relative  to  other  objects 
may  be  automatically  extracted  via  the  use  of  multiple  examples  of  instances  in  a 
training  set.  Geometric  knowledge  can  also  be  acquired  during  exploration  of  an 
environment  via  motion  and  stereo  processing.  Thus  object  and  scene  models  can 
be  continuously  acquired  and  refined  during  or  after  each  navigational  experience. 
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